Biobjective gradient descent for feature selection on high dimension, low sample size data

Even though deep learning shows impressive results in several applications, its use on problems with High Dimensions and Low Sample Size, such as diagnosing rare diseases, leads to overfitting. One solution often proposed is feature selection. In deep learning, along with feature selection, network sparsification is also used to improve the results when dealing with high dimensions low sample size data. However, most of the time, they are tackled as separate problems. This paper proposes a new approach that integrates feature selection, based on sparsification, into the training process of a deep neural network. This approach uses a constrained biobjective gradient descent method. It provides a set of Pareto optimal neural networks that make a trade-off between network sparsity and model accuracy. Results on both artificial and real datasets show that using a constrained biobjective gradient descent increases the network sparsity without degrading the classification performances. With the proposed approach, on an artificial dataset, the feature selection score reached 0.97 with a sparsity score of 0.92 with an accuracy of 0.9. For the same accuracy, none of the other methods reached a feature score above 0.20 and sparsity score of 0.35. Finally, statistical tests validate the results obtained on all datasets.


Introduction
Deep learning has gained popularity in various fields, especially when a large amount of data is accessible.However, in some research fields, the available data falls in the High Dimensions, Low Sample Size category (HDLSS).For example, in bioinformatics, predicting phenotypes from genetic variants is challenging due to high dimensionality and small number of patients.Similar problems are found in other areas, such as chemistry, financial engineering, and computer vision.When dealing with HDLSS data, most machine learning algorithms face severe overfitting and high-variance gradients [1].Feature selection is among the solutions proposed to overcome these problems.Feature selection in deep learning can be considered as a particular type of architecture pruning, where neurons are eliminated from the input layer rather than the hidden layers [2].This approach is a form of sparsity, where unneeded weights are set to zero during training.Recent research has focused on promoting sparsity in deep neural networks to improve their efficiency.Sparsity-based feature selection has proven to be effective in handling HDLSS data.Various techniques have been proposed, including dropout-based methods [3,4] and methods that add a regularization term to the network's loss function [5,6].
Either limited sparsity or regularization term in the objective function is used to promote global network sparsity.In addition to feature selection, sparsity should be sought to reduce a network's complexity and running time.Methods based on regularization use a single scalarized objective function consisting of a loss function that computes a network's error and a regularizer that reduces the network's weights and increases its sparsity.This approach has several disadvantages.First, the regularizer is controlled by a hyperparameter whose value may be difficult to set.Second, for a fixed hyperparameter, minimizing such a scalarized objective function gives only one solution.However, when dealing with conflicting objectives, such as loss and sparsity, it is generally not possible to find one optimal solution for both objectives at the same time, and one looks instead for trade-off solutions, the so-called Pareto set.Lastly, by varying the hyperparameter value, these methods can get points from the Pareto set only when the Pareto front is convex [7].Sener and Koltun [8] proposed a multiobjective approach for multitask learning to get a single Pareto optimal solution with a good trade-off among different tasks.Based on this work, Lin et al. [9] developed a new Pareto multitask learning algorithm (ParetoMTL) to get a set of well-distributed Pareto optimal solutions.While using multiobjective optimization to perform feature selection can improve the classification results, it has been limited to evolutionary and metaheuristics algorithms [10,11].These algorithms were not tested on HDLSS data and do not include the feature selection process into the learning phase.Additionally, most proposed algorithms, whether based on single objective optimization or multiobjective, focus on reducing the number of features without looking at which features were selected.This can be seen as a disadvantage as explainable models are increasingly demanded.To overcome this gap, this paper proposes a new Biobjective Feature Selection method (BFS) designed for HDLSS data that performs feature selection based on network sparsification.The approach is based on a constrained bicriteria gradient descent inspired from [9] with some improvements.The proposed method takes into account the large difference of scales on the two objective functions and improves the computational complexity of the algorithm.
The method was tested on HDLSS classification problems on both artificial and real datasets.It was shown that the method can perform better in terms of feature selection and sparsity while maintaining a high accuracy.The main contributions of this work are: • Proposing a new approach feature selection based on network sparsification for HDLSS data based on biobjective gradient descent.
• Developing a novel formulation of the constrained biobjective optimization problem with fewer constraints compared to the literature.By doing so, this method can be applied regardless of the number of regions in the objective space.
• Introducing a normalization step of the objective functions to ensure an objective function homogeneity requirement.
• Evaluating the method on HDLSS classification problems, both artificial and real.
This article is organized as follows.First, a summary of current works found in the literature is given.Then, details about the proposed method are presented followed by the obtained results and their discussion.Lastly, a conclusion, including a few perspectives and some limitations of the current approach, is presented.

Related works
As mentioned before, there are different feature selection approaches.Some of these approaches are based on neural networks.A summary of some state-of-the-art methods is given in this section.The focus is on methods embedding Feature Selection (FS) in the learning step of the algorithm.In addition, since this work is based on a biobjective gradient descent optimization algorithm, some popular multiobjective based approaches used in the field of neural networks are mentioned.

Feature selection
While most feature selection methods based on deep neural networks require a large number of training data and fail on HDLSS data, there exist few methods specifically tailored to overcome this problem.Deep Neural Pursuit (DNP) [4] is an MLP that uses multiple dropouts to learn important features.GRACES [12] is a method inspired by DNP.Still, it is based on graph convolution networks and uses latent relations between the samples in addition to multiple dropouts to improve feature selection.In [13], the authors propose DeepFS, a method combining autoencoders and feature extraction to select the relevant features.Other methods are based on stochastic gates.For example, in [14], the authors add stochastic gates to the input layer to perform feature selection: an open gate means the feature is important.Similarly, in [15], the authors use stochastic gates to implement what they call a locally sparse network.Their method can detect the important features that are specific to each sample.In [6], the authors add a FS layer that satisfies two constraints in order to select the relevant features.However, while this method gives a sparse input layer with only relevant features, it only works with a specific MLP architecture.
Other methods use regularization to perform feature selection based on sparsity.The most popular example of regularization is the lasso (l 1 ) [16].Other variants include sparse group lasso [5], and using both l 1 and l 1/2 [17].In DNN, applying l 1 regularization (lasso) consists of shrinking the network's connections weights' towards zero.However, as each neuron has many connections, the probability of removing a neuron from a network is very low.Therefore, new methods extending the lasso and considering this problem emerged.For instance, group lasso [18], defined as the (2,1)-norm, consists of shrinking to 0 a group of connections simultaneously rather than one at a time.The group of connections is usually made of all outgoing connections of one neuron, and eliminating it means canceling the neuron.Sparse group lasso [19] was proposed as a method to take advantage of the sparsity induced by both group lasso and lasso.
The regularization can be applied only to the input layer or to the whole network.When applied to the input layer, a lasso [16] or a group lasso [18] regularizer shows relevant feature selection results.When applied to the whole network, in addition to input feature selection, high-level feature selection is performed.This can lead to better results when dealing with HDLSS problems and can improve network interpretation.
In [5], the authors show how using group lasso and specifically sparse group lasso is effective for creating sparse networks.Yoon and Hwang [20] combine this group sparsity with an exclusive sparsity based on (1,2)-norm to introduce diversity into the selected weights.

Multiobjective optimization
One aim of multiobjective optimization is to find the Pareto set that is constituted of Pareto optimal solutions, i.e. non-dominated solutions (see S1 Appendix for definitions).Pareto optimization approaches became more widely used in the field of machine learning in general and deep learning specifically [21] to overcome the disadvantages of the scalarization approach.
There exist different types of algorithms to solve multiobjective optimization problems.One can cite methods based on Evolutionary Algorithms (EA) and methods based on gradient descent.Evolutionary Algorithms (EA) combining genetic algorithms and neural networks [22,23] are the most popular approach, and many variants have been proposed in recent works [24][25][26][27].However, because of the nature of these algorithms, learning relevant features is not done during the training step of the classifier.In addition, EA methods have been shown to be time-consuming for FS tasks [24].
As an alternative, metaheuristics based algorithms, are therefore a popular choice for FS problems.In [28], the authors introduced a FS method that uses a Particle Swarm Optimization (PSO) and a modified Firefly Algorithm (FA) inspired by quantum computing to find optimal Convolutional Neural Networks (CNNs).Optimization of their neural networks is based on two objective functions, the model's accuracy and the ratio between the number of selected features by the model and the overall number of features.In [29], the authors propose to combine Brain Storm Optimization, a metaheuristics algorithm to FA in order to select features.They use K-Nearest Neighbors, a supervised learning algorithm, for the classification task.In [11], the authors opted for a modified version of the FA to propose a hybrid two-level framework to achieve FS and hyperprameter tuning of an eXtreme Gradient Boosting (XGBoost) machine learning model.Similarly, in [10], the authors propose a two-level framework based on a novel diversity oriented social network search metaheuristics algorithm.The algorithm is used to achieve FS and machine learning tuning.However, in both methods, the FS is done in the first level and is fed as an input to the second level of the framework.In other works, a metaheuristic algorithm is combined with a Genetic Algorithm (GA).For instance, in [30], the authors combine Assimilated Artificial Fish Swarm Optimization (AAFSO) and GA with a CNN.AAFSO is used for feature selection and the GA for choosing the hyperparameters of the CNN.In [31], the authors use a qualitative approximation approach to adapt both GA and PSO to get models that converge faster.This makes it possible to test such algorithms on large datasets.Wu et al. [32] introduced a federated deep fuzzy neural network that uses PSO to optimize the network's architecture based on four objective functions for Epistatic Detection.The fuzzy logic makes it easier to interpret the neural network's results and thus to detect the important genes causing a disease.However, these methods, just like EA methods, require the feature selection step to be evaluated separately from the training step.In addition, none of these methods have been specifically designed for HDLSS data.
Gradient-based methods can therefore be more interesting for FS tasks.In practice, finding Pareto optimal solutions is difficult, and several works define necessary conditions for Pareto optimality.The most used are Pareto criticality [33] and Pareto stationarity [34].Fliege and Svaiter [33] proposed a generalization of the single objective steepest descent algorithm based on Pareto criticality, while Desideri [34] proposed another approach based on Pareto stationarity (See S1 Appendix for more details).While reported in the literature, the equivalence between Pareto stationarity and Pareto criticality definitions was never formally proven before, and sometimes the terms are used interchangeably [35].Another observation that was reported in the literature is the strong link between the two optimization problems defined by Fliege and Svaiter and Desideri via the duality theory [33].However, no detailed proof was provided.A formal proof of the equivalence between the two definitions is presented in S1 Appendix.Furthermore, a formal proof demonstrating the equality of the two directions computed by the Desideri approach and the Fliege and Svaiter approach, is provided in S1 Appendix.
In recent years, multiobjective gradient descent algorithms have been increasingly used in DNNs [36].So far, these algorithms have been mostly used for multi-task learning [8,9,37] and not for FS for HDLSS data.In [8], the authors aim to obtain one solution with a trade-off for both tasks.In [9], Lin et al. developed ParetoMTL, a constrained multiobjective gradient descent-based method, to generate multiple networks with different trade-offs between the tasks.However, ParetoMTL has some limitations, particularly in its inability to generate dispersed networks when tasks have varying levels of difficulty.In [37] the authors developed multiple algorithms to obtain Exact Pareto Solution (EPO) for multi task problems and multicriteria decision making.Others base their work on hypernetworks first introduced by [38] in order to find a continuous Pareto front.In [39], the authors use multi-sample hypernetworks and hypervolume indicator maximization for multi task learning.However, none of these methods have been tested for feature selection purposes.

General notations
Given a training set of N inputs X = {X 1 , . .., X N }, (X i 2 R p ; 1 � i � N), with p the number of features such as p > >n, and their corresponding labels Let L be the layers of a neural network, with l = 0 (resp.l = L) the input (resp.output) layer.For h l j , the j th neuron of the layer l (1 � l � L), its activation value a l j is defined by a l j ¼ �ðz l j Þ where: with H l the size of the layer l, w l ij the weight of the connection from neuron h lÀ 1 i to neuron h l j , b l j the bias of h l j , ϕ its activation function and a 0 i the i th input feature.For the hidden layers (1 � l � L − 1), the used activation function is the rectified linear unit function (ReLU) defined as ϕ (z) = max(0, z).As for the output layer L, the used activation function is the softmax function � : R K !ð0; 1Þ K defined as: The values ϕ(z L ) k represent the posterior probabilities, estimated by the network, that a given input belongs to a class c k , for 1 � k � K.
The neural network is represented by its set of parameters Þ and the bias b l i of the neuron h l i .For an input X i , the output of the neural network is the vector: where

Choice of the objective functions
As stated before, feature selection is performed through network sparsification.Suppose all the outgoing connections from an input neuron are close to zero, whereas the accuracy of the model is still high.In that case, this input feature is irrelevant to the classification task.Moreover, since the sparsity of the networks is also a desirable property, sparsity is enforced beyond the input layer, i.e., also in the hidden layers.Beyond a certain level of sparsity, the accuracy will decrease.So, the goal of this method is to balance the network's accuracy and its sparsity, which is achieved through a biobjective optimization approach.In the sequel, experiments are run on Multi-Layer Perceptrons (MLP) neural network architecture, but the method can also be adapted to other types of architectures.Two objective functions are used in BFS.The first objective, denoted as f 1 , is the crossentropy: where Y ik and Ỹ ik the k-th component of the class vector Y i and predicted vector Ỹi respectively.The second objective is the sparse group lasso regularization term [19] that allows the removal of individual connections and neurons from the network.The idea is to consider sparsity at the level of groups of connections to force all outgoing connections from a single neuron to be simultaneously set to zero.More precisely, the aim is to consider two non-overlapping groups of connections that correspond to the sparsity groups:

Groups representing the input layer G input
An element g 2 G input is a vector that represents the weights of all outgoing connections from a neuron in the input layer:

Groups representing the hidden layers G hidden
An element g 2 G hidden is a vector that corresponds to the weights of all outgoing connections from a neuron in the hidden layers: The second objective, denoted as f 2 , is the sparse group lasso regularization term: where #g represents the dimension of the vector g and ||g|| 2 the Euclidean norm of y l i and jjy l i jj its L1 norm.Note that this objective encourages sparsity at both the individual and group levels.In the sequel of this section, the optimization approach used in this method is presented and then, the initialization and training steps are explained.

Optimization approach
This work proposes a new approach for feature selection based on architecture optimization, using a constrained multiobjective gradient descent by Lin et al. [9] with two major modifications to overcome two limitations of their proposed work.Two kinds of normalization are performed during the search process in order to take into account the large difference in scales between the two objective functions.Moreover, the regions are defined differently, using only 2 constraints instead of K.This leads to fewer constraints in the optimization problem, making the problem faster to solve, especially for large values of K.
Before running the algorithm, the number of regions K must be chosen and the objective space is divided accordingly into K different regions, denoted by þ jv:z k � 0 and v.z k+1 � 0} where z k and z k+1 are the normal vector starting from the origin with respect to the vectors delimiting the region as shown in Fig 1 .The biobjective problem with respect to the k-th region can now be written as follows: where f 1 and f 2 are the two objective functions defined above.In practice, as the two objective functions are of very different scales, normalization is required.
The proposed method called BFS for Biobjective Feature Selection consists of two steps: an initialization step that leads to dispersed initial solutions and a training step that provides a Pareto front.Initialization step.The first step is to initialize the networks.This step is very important as it allows initial solutions to be dispersed in the objective space.This leads to well-representative solutions.For this, the algorithm starts from a random solution θ, and for each 1 � k � K, gradient descent is performed to get a solution inside the region C k or at least as close to it as possible.Let I ∊ ðyÞ ¼ fjjG k j ðyÞ � À ε; j ¼ 1; 2g be the set of active constraints, where ε is a threshold added to avoid solutions to remain on the region boundaries.The optimal descent direction d 0 and the temporary variable γ 0 are computed by solving the following As can be seen in Fig 3, there can be at most one active constraint, i.e. #I � (θ) � 1: 1.If #I � (θ) = 0, the solution is in its region, and there is no need to pursue the gradient descent (Fig 3(A)).
2. If #I � (θ) = 1 the solution is not in its region, gradient descent is performed in order to minimize the degree of violation of the active constraint ( Training step.After the initial solution is reached, each network is trained using a constrained multiple-gradient descent algorithm.The optimal descent direction d* can be  obtained by solving the following problem: where γ* is a temporary variable, rf i (θ) the gradient of the objective function i and rG k j ðyÞ the gradient of the constraint j.However, because of the high number of parameters found in neural networks, it is more efficient to solve the following related dual problem (6).For more details on how to obtain the dual, please refer to S1 Appendix (Remark 3).
where λ i and β j are the Lagrange multipliers.The descent direction d* is then calculated as In this step, as shown in the diagram in Fig 4, the normalization consists of dividing each objective function by the absolute value of the objective function obtained at the end of the initialization for the network being trained.The algorithm stops when a number of epochs is reached or when there is no more improvement for 10 successive epochs.
The BFS method is summarized in Fig 4.

Experiments
BFS is compared to four different algorithms: L1-NN, SGL-NN, MOO-MTL, and BFS init only.L1-NN represents an MLP with a scalarized objective function that combines the crossentropy and the L1 regularizer.SGL-NN is also a mono-objective MLP with a scalarized objective function combining the cross-entropy and the Sparse Group Lasso (SGL) regularizer as in [5].MOO-MTL is an MLP based on an unconstrained multiobjective optimization as in [8].BFS init only combines BFS and MOO-MTL: the networks are initialized using the BFS initialization step and then the unconstrained multiobjective optimization from MOO-MTL is used for the training step.ParetoMTL comparison was omitted as it was impossible to generate results because of a high running time.See Section 1.3 in S2 Appendix for more details.
For each method, a combination of random search and cross-validation are used to identify the optimal set of hyperparameters.In order to get a large number of solutions, the objective space for BFS is split into 100 regions.For a fair comparison, each of the methods mentioned above was run 100 times using the best hyperparameters found.Details on the hyperparameter choices and their search range can be found in Section 1.2 in S2 Appendix.
To assess the quality of the obtained solutions in terms of classification, feature selection, and sparsity, the following metrics were used: • Classification metrics: used to evaluate the performance of a model.Four metrics will be used in this work.First, the classification accuracy, which is an indicator of how often the classifier is correct overall.Second, recall, which shows whether a model can find all objects of the target class.Then, precision, which indicates how often the model is correct when predicting the target class.Finally, the F1-score, which is the harmonic mean of the recall and precision scores, and serves to assess the overall performance of a model.
• FS score: to evaluate the feature selection capacity of a method.Let G imp be the set of outgoing connection weight vectors of the input neurons corresponding to the relevant features and G input the set of outgoing connection weight vectors of all input neurons.The FS score is defined similarly to the Informative Features Weight Ratio (IFWR) score, introduced in [14], as follows: The higher the FS score, the better the feature selection capacity of the method.
In addition, to get the importance and rank each feature, the norm-1 value of the outgoing weights for each input neuron is computed.In other words, for each feature g, where g 2 G input , the following score is calculated: The higher score(g) is, the more important the feature g is.• Sparsity score: All weights with an absolute value smaller than 10 −5 are considered to be equal to 0. Let θ null be the set of weights equal to 0 and θ the set of all weights in a network.The sparsity score S is defined as follows: S ¼ jy null j jyj : The higher the score, the more sparse the network.
Using the above described metrics, BFS is compared to the previously described state-ofthe-art methods on artificial and real HDLSS datasets.

Artificial datasets
AD1 and AD2, two artificial datasets of different difficulty levels, were considered.They were generated using the scikit-learn [40] package.Details about each dataset are provided in Table 1.Section 1.1 S2 Appendix, contains more information for reproducing the datasets.The AD1 dataset is a binary classification problem and is relatively easy since it contains only 2 classes.The AD2 dataset is harder as it is a multiclass classification problem with 3 classes, and the ratio of informative features to noise data is lower than in the AD1 dataset.In both datasets, the informative features follow a standard normal distribution, the rest of the features are just noise data following a uniform distribution.In order to simulate HDLSS data, the number of train instances was set to be lower than the total number of features in each dataset.Each algorithm was run 10 times.In each run, 100 neural networks were constructed.Table 2 summarizes the results obtained on the artificial datasets AD1 and AD2.The 5 methods are first compared by evaluating 6 metrics: the accuracy, precision, recall and F1-score as the classification metrics, the FS score as the feature selection metric and the S score as the sparsity score.The values reported in Table 2 correspond to the mean of the highest values obtained in each run.For the classification metrics, the network with the highest accuracy among the 100 network in each run, is used to calculate the mean of each metric.For the FS and S scores, the network with the highest FS score and highest S score respectively are used.In the case of AD1, BFS, MOO-MTL and SGL-NN all have similar high accuracy, precision, recall and F1 score, around 0.900.L1-NN and BFS init only have a lower accuracies around 0.820.The other classification metrics all have the same score as the accuracy.BFS has the highest FS and S sores with 0.970 and 0.916 respectively followed closely by BFS init only with a score of 0.907 for FS and 0.871 for the sparsity.MOO-MTL comes in third with low FS score and sparsity, 0.202 and 0.344 respectively.Finally, with both L1-NN and SGL-NN, neither feature selection or network sparsification occurs as for both methods the scores are low.As for AD2, BFS and BFS init only have the highest FS score, 0.998 and a very similar S score, 0.96 and 0.95 respectively.For BFS, all classification metrics are around 0.770.For BFS init only, all classification metrics are around 0.630, the lowest among all methods.BFS' accuracy is slightly lower than the highest accuracy obtained on this dataset which is 0.807.This accuracy is obtained by MOO-MTL.However, MOO-MTL performs poorly in terms of feature selection and sparsification, with a FS score of 0.229 and a S score of 0.471 meaning prediction was based on the wrong features.SGL-NN has the same accuracy as BFS, 0.763 but also performs poorly on feature selection and sparsity with 0.329 and 0.009 on the FS and S scores respectively.As for L1-NN, the models have a low accuracy, 0.687 and no feature selection (0.034) and no sparsity (0.000) at all.In addition to the different metrics,  low sparsity and few relevant features selected only.With MOO-MTL, the same problem occurs except that no solution reaches a high feature selection score which is in concordance with the results obtained on the 10 runs.As for SGL-NN and L1-NN, there is no Pareto front that can be seen as all solutions for both methods are gathered in one place, and no diversity in the solutions can be seen.diversity in the solutions like for AD1.In summary, replacing L1 with SGL in a mono objective network is not enough to enhance the ability of neural networks to select features.This study highlights the significance of multiobjective optimization in creating networks that are both accurate and sparse while also relying on the correct features.
Real datasets.For these experiments, two real text datasets were used: BASEHOCK and PCMAC, introduced in [41].The features in these datasets represent real and noise variables.In order to mimic HDLSS data, only 20% of the dataset was used for training, and the rest was used as a test dataset.Both datasets are used for feature selection purposes.BASEHOCK contains 2 classes, 4862 features and 1993 samples.The features are all sparse discrete variables with a majority of values being equal to 0. 994 samples belong to class 0 and 999 to class 1.As for PCMAC, the dataset is also composed of 2 classes with 1943 samples and 3289 sparse discrete features.The dataset is also balanced with 982 samples of class 0 and 961 of class 1.The relevant characteristics of the datasets used are summarized in Table 3. Similarly to artificial datasets, in addition to the classification metrics, the sparsity and feature selection capacity of each method are evaluated.
Each method was also run 10 times, and the reported metrics in Table 4 represent the mean for the best network per iteration of each metric for each method, similarly as for artificial datasets.In the case of real datasets, only the classification metrics and the S score are calculated as the relevant features are not known and the feature selection ability will be evaluated later.For BASEHOCK, all methods have a similar accuracy ranging from 0.898 for L1-NN and SGL-NN to 0.893 for M00-MTL and finally to 0.874 for BFS except for BFS init only who has an accuracy of 0.735.The other classification metrics are all the same as the accuracy for each method.However, the main difference between the methods lies in the S score.With a score of 0.97 and 0.98, respectively, BFS and BFS init only give the most sparse networks.MOO-MTL comes after with 0.848 sparse network.Finally, as in the case of the previous datasets, L1-NN and SGL-NN can not achieve any sparsification, which can be seen with the null score obtained.As for PCMAC, the results follow the same tendency, with a close accuracy between the different methods ranging from 0.791 for BFS to 0.815 for MOO-MTL and higher S score for BFS and BFS init only than the other methods.The lowest accuracy was once again for BFS init only The other classification metrics take the same values as the accuracy for each method.The Pareto front and diversity of the obtained solutions on 1 run are shown in Fig 6.For BASEHOCK, BFS is the only method with a complete Pareto front and diverse solutions.BFS init only has only 3 out of the 100 solutions with the rest of the solutions towards the bottom of the space as can be seen in Fig 6 .Similarly, for MOO-MTL, few solutions only can be differentiated from the rest of the solutions with a very low sparsity.SGL-NN and L1-NN have no diversity in the solutions as they are all stacked at the same position.PCMAC results follow the same pattern as BASEHOCK results which is also in concordance with the different calculated metrics.BFS has a high diversity between the solution in BFS.For BFS init only and MOO-MTL, only few solutions have a high sparsity with the others all gathered in the lower part of the objetcive space.As for SGL-NN and L1-NN, they have the same problem as in the other datasets: no sparsity occurs in networks trained with these methods.
When dealing with real datasets, relevant features are usually not known beforehand, unlike in artificial datasets.Therefore, the ability of a method to perform feature selection using the FS score introduced earlier is not possible.Hence, to assess feature selection capacity, each algorithm will be considered as a filter method to select a subset of features.Then, the selected features will be used as input to train a classification algorithm.If the selected features are relevant, the classification algorithm's accuracy will increase rapidly before stabilizing.In this protocol, the solution with the highest accuracy for each method and each dataset is first selected.The features are then ranked using the score described earlier using the norm-1 and the N most highly ranked features are selected.Afterwards, the selected features are used as input to three traditional classification algorithms: Random Forest (RF), Extreme Gradient Boosting (XGB), and Support Vector Classification (SVC).The mean test accuracy obtained with all three methods is then reported.This procedure is repeated by incrementing the number of chosen features N, retraining the three algorithms, and calculating the mean test accuracy again.The procedure is stopped when there is no longer any significant improvement in the test accuracy.The results are displayed in Fig 7. The focus in both datasets was on the first 500 features only since no significant accuracy improvement was observed beyond that with any method.For BASEHOCK, with 5 features used to train the algorithms, BFS init only starts with the highest accuracy, 0.7, followed by BFS, 0.65, and then L1-NN, MOO-MTL, and SGL-NN with a starting accuracy of 0.55 and below.All methods will converge to an accuracy around 0.85 after adding 500 features.However, when adding only the first 130 ranked features, BFS and BFS init are the only methods that reach it.300 additional selected features by L1-NN and SGL-NN need to be added to reach this same accuracy, and the first 500 selected features from MOO-MTL are needed to obtain this same accuracy.As for PCMAC, after adding 5 features, SGL-NN and MOO-MTL start with an accuracy of 0.65, L1-NN with an accuracy of 0.63, and BFS and BFS init only with an accuracy of 0.57.However, it then requires training with less than 100 features selected with BFS to reach an accuracy of 0.75.With BFS init only, around 120 features are needed to reach this same accuracy, and 170 features are needed with SGL-NN.MOO-MTL and L1-NN select the least relevant features as the first 370 selected features are needed with MOO-MTL and all 500 features selected with L1-NN to reach the same accuracy as with the previous methods.While the results show that BFS manages to get higher results in terms of feature selection and sparsity, a statistical analysis was conducted to check if there is a significant improvement and confirm the obtained findings.For each dataset, 10 values, the best from each iteration, was used.The tests were performed three times: once one the accuracy metric, once on the sparsity metric and once on the feature selection metric.In addition, in order to decide the test to use, the safe use of parametric tests conditions, that is independence, normality and homoscedasticity are tested.The independence condition is verified as for each method and dataset, the 10 run are independent.For the normality condition, the Shapiro-Wilk test was run and the obtained p-values are reported in Table 5 for all three metrics.The p-values are all smaller that α = 0.05, except for the FS score in the case of MOO-MTL, meaning that the normality condition is not satisfied therefore, the use of non parametric tests is more appropriate.
As there are only 5 methods and 10 runs for each methods, separate Wilcoxon tests were performed between the different methods for each dataset, with BFS as the control as suggested in [42].The obtained p-values for all metrics are presented in Tables 6-8 for the accuracy, S score and FS score respectively.Two methods are considered different on a metric and for a dataset if the corresponding p-value is less than α = 0.05.For the accuracy, SGL-NN and BFS as well as L1-NN have equal accuracies as the p-value is greater than 0.05 for all datasets.For BFS init only and MOO-MTL, the accuracies are different when compared to BFS as the pvalue is less than 0.05 for all datasets.As for the sparsity and feature selection, the difference between the values obtained with BFS and MOO-MTL, SGL-NN and L1-NN is significant as for all datasets, the p-value is less than 0.05 for both metrics.However, when compared to BFS init only, the hypothesis that the two methods are equal in terms of sparsity can not be rejected for the BASEHOCK dataset as the p-value is 0.084 which is greater than 0.05.Similarly, the hypothesis that BFS and BFS init only are equal in terms of feature selection can not be rejected as the p-values are 0.08 and 0.23 on AD1 and AD2 respectively.These tests show that BFS improves the sparsity and feature selection abilities of a network.As for BFS init only, while both sparsity and feature selection abilities are the same as in BFS, the accuracy is not and there is no real diversity in the obtained solutions.

Discussion
In this work, BFS, a novel constrained biobjective optimization method to perform feature selection and network sparsification has been presented.This approach proposes a new formulation for the constrained multiobjective optimization that is faster and easier to use.In fact, by limiting the number of constraints needed at every iteration of the gradient descent step, the running time of the algorithm is decreased and it is possible to split the objective space into many regions to diversify the solutions.The latter is a downside of the ParetoMTL method as it is impossible to run it on 100 regions (refer to Section 1.3 S2 Appendix for details).Using the dual method and the cvxopt solver, which takes less than a fraction of a second to run, BFS takes almost the same amount of time as a single objective neural network which is approximately 2 minutes for 1000 epochs.In addition, by integrating the sparse group lasso into the training step of the neural network, the need of an additional level to perform feature selection was eliminated, without adding an extra layer to the network.Moreover, the improvement in FS and sparsification was statistically validated in the case of HDLSS compared to the other methods.By comparing the proposed approach to L1-NN and SGL-NN, the impact of using an appropriate regularization function and multiobjective optimization was demonstrated.Using L1 is not adapted for feature selection in the case of HDLSS data.Furthermore, using an SGL regularizer in a scalarized single objective favors feature selection but not sparse networks.It is worth noting that single objective neural networks that use sparse group lasso do not reduce the weights sufficiently to create sparse networks, as it was shown on both artificial and real datasets.In addition, the normalization step added to deal with objective functions of different scales and the constraints increased the diversity of the solutions.MOO-MTL gave better accuracies in general but less diversity in the solutions.Without normalization, solutions tend to differ only on the objective with the highest gradient, which is sparsity in this case.
Although the predictions are highly accurate, they cannot be solely attributed to informative features.This is supported by the low FS score observed for artificial datasets and the inferior feature ranking for real datasets.This means that the models' predictions depend on noise or artifacts present in the data rather than on relevant features.In this work, the impact of having the constraints in both the initialization and training steps was also studied by comparing the proposed approach to BFS init only.It was observed that removing the constraints from the training step resulted in a reduction of the Pareto front coverage, leading to less diverse solutions.For instance, in the case AD2, only one solution reached good performance.A more complete and diverse Pareto front enables the user to select the most appropriate solution based on the application's requirements.

Conclusion
In this work, it was shown that sparse group lasso regularization is an efficient way to perform simultaneously feature selection and Deep Learning sparsification.The results obtained from experiments on various datasets show that the presented approach, based on a biobjective gradient descent, outperformed the other methods that use regularization approaches regarding selected relevant features.In addition to choosing the most relevant features, this method also has the advantage of generating sparse neural networks.It is worth noting that single objective neural networks that use sparse group lasso do not reduce the weights sufficiently to create sparse networks, as shown on both artificial and real datasets.Another advantage of this approach is that it provides several diverse non-dominated solutions allowing the user to choose the most suitable one according to the requirements of his application.For example, if he has very limited memory resources, he may prefer a very sparse solution instead of a very accurate one.
In future work, this method will be applied to mutation data as they are known to be of high dimensions and low sample size.In the health field and more specifically for precision medicine, being able to explain the results obtained or in this case, select the mutations on which to base the prediction is crucial.In addition, in this approach the problem of generating a discrete Pareto front that contains well-dispersed solutions was addressed.So far, very few works have addressed the problem of generating continuous Pareto front for multi-task learning and fairness problems based on hypernetworks [43].These approaches will be studied in the context of feature selection for HDLSS data.
However, BFS has a few limitations.First, sparse group lasso does not take into account redundant and correlated features.This problem can be addressed by using a different objective function rather than the sparse group lasso.In addition, the feature selection step is based on an FS score computed directly on the weights.While this method has been used frequently used in deep learning and gives reliable results, there exist other methods that can be more accurate such as LRP [44].This will be addressed in future work.Finally, as pointed out in [37] solutions will only stay in their regions if the number of regions is large.One way to overcome this, is by using Exact Pareto Optimal search which has proved to be efficient in some multitask learning in deep learning as proposed by the authors.This idea will be investigated in the case of feature selection.
Fig 2 illustrates the two steps.

Fig 1 .
Fig 1.The region C k ¼ fv ∈ R 2 þ jv:z k ≥ 0 and v:z kþ1 ≤ 0g.https://doi.org/10.1371/journal.pone.0305654.g001 Fig 3(B) and 3(C)).In this step, as shown in the diagram in Fig 4, normalization consists of dividing each objective function by the norm-2 of the vector composed of the objective functions.The termination criteria for this step are either the region is reached or 20% of the total number of iterations have passed.

Fig 4 .
Fig 4. Detailed workflow of the BFS algorithm.In the blue frame are shown the steps used during the initialization steps.Once the solution is in its region or the maximum number of iterations is reached, it will move to the red frame steps.The red frame corresponds to the training steps, which will keep running until the maximum epochs is reached or no more improvements are detected for 10 consecutive iterations.https://doi.org/10.1371/journal.pone.0305654.g004

Fig 5 ,
shows the Pareto front and solution dispersion for one run for each algorithm (100 networks).Fig 5(A) and 5(B) shows the sparsity score and the FS score respectively, along with the test accuracy, on AD1.BFS has the most complete front with the solutions being dispersed in the objective space.BFS init only manages to have very few solutions that are on the Pareto front with the rest all having

Fig 5 (
C) and 5(D) show the sparsity score and the FS score respectively, along with the test accuracy, on AD2 for all solutions returned by the 5 methods.Similarly to AD1, BFS is the method with the most complete Pareto front.BFS init only has a few solutions with high values in both feature selection and sparsification while MOO-MTL solutions are gathered, and the method obtains low S and FS scores.SGL-NN and L1-NN have no

Fig 6 .
Fig 6.Summary of the sparsity evaluation results on the real datasets.(A) shows the sparsity and test accuracy of the solutions obtained with the different methods on the BASEHOCK dataset.(B) shows the sparsity and test accuracy of the solutions obtained with the different methods on the PCMAC dataset.https://doi.org/10.1371/journal.pone.0305654.g006

Fig 7 (
A) shows the accuracy obtained when the number of selected features is varied on the BASEHOCK dataset.Fig 7(B) shows the accuracy obtained when the number of selected features is varied on the PCMAC dataset.Recall that the results shown are for one iteration only.

Fig 7 .
Fig 7. Summary of the feature selection evaluation results on the real datasets.(A) shows the mean test accuracy obtained when training the classification algorithms based on the number of features selected by the different methods to compare on the BASEHOCK dataset.(B) shows the mean test accuracy obtained when training the classification algorithms based on the number of features selected by the different methods to compare on the PCMAC dataset.https://doi.org/10.1371/journal.pone.0305654.g007

Table 2 . Table summarizing the obtained metrics on the artificial datasets.
Metrics shown are the averages computed from 10 runs.

Table 8 . Table with the p-values obtained with the different Wilcoxon tests performed on the feature selection metric.
https://doi.org/10.1371/journal.pone.0305654.t008